The objective is to generate a model that, given a transaction, yields the probability that there is some type of fraud. In addition, it is essential to analyze the data, search for profiles, generate the appropriate descriptives, variable selection and transformation, etc.

import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import plotly.express as px
import requests
import warnings
warnings.filterwarnings('ignore')
# visualization options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
# data read
df = pd.read_csv('../data/raw/dataset_payments_fraud.csv', sep = ';')
df.tail()
| step | type | amount | gender | device | connection_time | nameOrig | race | oldbalanceOrg | age | newbalanceOrig | zone | user_number | nameDest | user_connections | security_alert | oldbalanceDest | newbalanceDest | isFraud | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1048570 | 95 | CASH_OUT | 132557.35 | woman | mac | 0,420243195 | C1179511630 | black | 479803.00 | 39 | 347245.65 | NaN | 3179 | C435674507 | 9 | 0 | 484329.37 | 616886.72 | 0 |
| 1048571 | 95 | PAYMENT | 9917.36 | woman | mac | 0,37616924 | C1956161225 | black | 90545.00 | 36 | 80627.64 | africa | 2558 | M668364942 | 9 | 0 | 0.00 | 0.00 | 0 |
| 1048572 | 95 | PAYMENT | 14140.05 | unknow | pc | 0,837108874 | C2037964975 | black | 20545.00 | 31 | 6404.95 | africa | 1553 | M1355182933 | 2 | 0 | 0.00 | 0.00 | 0 |
| 1048573 | 95 | PAYMENT | 10020.05 | man | mac | 0,480931208 | C1633237354 | black | 90605.00 | 94 | 80584.95 | country | 642 | M1964992463 | 8 | 0 | 0.00 | 0.00 | 0 |
| 1048574 | 95 | PAYMENT | 11450.03 | man | pc | 0,868197534 | C1264356443 | black | 80584.95 | 84 | 69134.92 | country | 2081 | M677577406 | 4 | 0 | 0.00 | 0.00 | 0 |
In order to obtain the connection_time variable as numeric, we change the "," for a ".". We will repeat this step during the data preprocessing.
df = df.assign(**{'connection_time': lambda df: df['connection_time'].str.replace(',', '.').astype(float)})
df.tail()
| step | type | amount | gender | device | connection_time | nameOrig | race | oldbalanceOrg | age | newbalanceOrig | zone | user_number | nameDest | user_connections | security_alert | oldbalanceDest | newbalanceDest | isFraud | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1048570 | 95 | CASH_OUT | 132557.35 | woman | mac | 0.420243 | C1179511630 | black | 479803.00 | 39 | 347245.65 | NaN | 3179 | C435674507 | 9 | 0 | 484329.37 | 616886.72 | 0 |
| 1048571 | 95 | PAYMENT | 9917.36 | woman | mac | 0.376169 | C1956161225 | black | 90545.00 | 36 | 80627.64 | africa | 2558 | M668364942 | 9 | 0 | 0.00 | 0.00 | 0 |
| 1048572 | 95 | PAYMENT | 14140.05 | unknow | pc | 0.837109 | C2037964975 | black | 20545.00 | 31 | 6404.95 | africa | 1553 | M1355182933 | 2 | 0 | 0.00 | 0.00 | 0 |
| 1048573 | 95 | PAYMENT | 10020.05 | man | mac | 0.480931 | C1633237354 | black | 90605.00 | 94 | 80584.95 | country | 642 | M1964992463 | 8 | 0 | 0.00 | 0.00 | 0 |
| 1048574 | 95 | PAYMENT | 11450.03 | man | pc | 0.868198 | C1264356443 | black | 80584.95 | 84 | 69134.92 | country | 2081 | M677577406 | 4 | 0 | 0.00 | 0.00 | 0 |
We take the target variable which describes the transactions that are fraud
plot_df = df['isFraud']\
.value_counts(normalize = True)\
.mul(100).rename('percent').reset_index()
plot_df_conteo = df['isFraud'].value_counts().reset_index()
plot_df_pc = pd.merge(plot_df, plot_df_conteo,
on = ['index'], how = 'inner')
plot_df_pc
| index | percent | isFraud | |
|---|---|---|---|
| 0 | 0 | 99.89109 | 1047433 |
| 1 | 1 | 0.10891 | 1142 |
fig = px.histogram(plot_df_pc, x = "index", y = ['percent'])
fig.show()
We observe that the target variable is completely unbalanced, since, as is normal, only a small percentage of transactions are fraudulent. In this case only 0.108%.
df.dtypes
step int64 type object amount float64 gender object device object connection_time float64 nameOrig object race object oldbalanceOrg float64 age int64 newbalanceOrig float64 zone object user_number int64 nameDest object user_connections int64 security_alert int64 oldbalanceDest float64 newbalanceDest float64 isFraud int64 dtype: object
df.duplicated().sum()
0
There are no duplicates in the dataset
df_null_columns = df.isnull().sum().sort_values(ascending=False)
df_null_rows = df.isnull().sum(axis=1).sort_values(ascending=False)
null_columns = pd.DataFrame(df_null_columns, columns=['missing_values'])
null_rows = pd.DataFrame(df_null_rows, columns=['null_rows'])
null_rows['target'] = df['isFraud'].copy()
null_columns['porcentaje'] = null_columns['missing_values']/df.shape[0]*100
null_rows['rows_percentaje']= null_rows['null_rows']/df.shape[1]
null_columns
| missing_values | porcentaje | |
|---|---|---|
| race | 105163 | 10.029135 |
| gender | 105118 | 10.024843 |
| device | 104580 | 9.973536 |
| zone | 104414 | 9.957705 |
| step | 0 | 0.000000 |
| newbalanceDest | 0 | 0.000000 |
| oldbalanceDest | 0 | 0.000000 |
| security_alert | 0 | 0.000000 |
| user_connections | 0 | 0.000000 |
| nameDest | 0 | 0.000000 |
| user_number | 0 | 0.000000 |
| age | 0 | 0.000000 |
| newbalanceOrig | 0 | 0.000000 |
| type | 0 | 0.000000 |
| oldbalanceOrg | 0 | 0.000000 |
| nameOrig | 0 | 0.000000 |
| connection_time | 0 | 0.000000 |
| amount | 0 | 0.000000 |
| isFraud | 0 | 0.000000 |
We observe 4 variables with null data or missing values. For race, we will drop it for our dataset, due to sensitive data. For the rest, we will treat null data as other outcome of the variable.
For the study of outliers, we first select those that are numeric (int or float) and then we do a range for each variable, to understan which one can have outliers.
df_numeric = df[['step', 'amount', 'connection_time','oldbalanceOrg', 'age', 'newbalanceOrig',
'user_number', 'user_connections', 'security_alert', 'oldbalanceDest','newbalanceDest']]
pd.DataFrame({'min': df_numeric.min(), 'max': df_numeric.max()}).T
| step | amount | connection_time | oldbalanceOrg | age | newbalanceOrig | user_number | user_connections | security_alert | oldbalanceDest | newbalanceDest | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| min | 1.0 | 0.1 | 2.618800e-08 | 0.00 | 5.0 | 0.00 | 59.0 | 1.0 | 0.0 | 0.00 | 0.00 |
| max | 95.0 | 10000000.0 | 9.999991e-01 | 38939424.03 | 100.0 | 38946233.02 | 5000.0 | 10.0 | 1.0 | 42054659.73 | 42169156.09 |
We can foresee five variables with possible outliers, being them the amount and the old/new balances from Origin/Destination.
In order to understand what can be called an outlier, we will use the _get_deviation_of_meanperc function with a multiplier of 2 in order to get those outside the Q1-2IQR and Q3+2IQR of each of these variables.
list_continuous_variables = ['amount', 'oldbalanceOrg', 'newbalanceOrig',
'oldbalanceDest', 'newbalanceDest']
from functions import get_deviation_of_mean_perc
get_deviation_of_mean_perc(df, list_continuous_variables, target='isFraud', multiplier=2)
| 0.0 | 1.0 | variable | sum_outlier_values | porcentaje_sum_null_values | |
|---|---|---|---|---|---|
| 0 | 0.986908 | 0.013092 | amount | 32081 | 0.030595 |
| 1 | 0.998938 | 0.001062 | oldbalanceOrg | 45203 | 0.043109 |
| 2 | 0.999934 | 0.000066 | newbalanceOrig | 45698 | 0.043581 |
| 3 | 0.999480 | 0.000520 | oldbalanceDest | 40402 | 0.038530 |
| 4 | 0.998491 | 0.001509 | newbalanceDest | 41751 | 0.039817 |
The only outlier values that have. a greater proportion of fraud than the average dataset is amount. Due to its nature, we will use this data since it provides useful insights of when there is more probability to be fraud.
plt.figure(figsize=(16, 8))
sns.boxplot( data = df, y = 'amount', x = 'isFraud', palette="icefire")
plt.title('amount vs Fraud',fontweight="bold", size=20)
Text(0.5, 1.0, 'amount vs Fraud')
For easier visualization purposes we will rearrange the variables in the dataset, locating the balances and amount at the end of the dataset (but before the objetive varible isFraud). As stated before, in order to comply with the GDPR, we will not include the race variable
df = df[['step','type', 'gender', 'device', 'connection_time','age', 'zone', 'user_number',
'user_connections', 'security_alert', 'nameOrig', 'nameDest', 'amount', 'oldbalanceOrg', 'newbalanceOrig',
'oldbalanceDest', 'newbalanceDest', 'isFraud']]
df.head()
| step | type | gender | device | connection_time | age | zone | user_number | user_connections | security_alert | nameOrig | nameDest | amount | oldbalanceOrg | newbalanceOrig | oldbalanceDest | newbalanceDest | isFraud | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | PAYMENT | man | mac | 0.140039 | 85 | capital | 138 | 5 | 1 | C1231006815 | M1979787155 | 9839.64 | 170136.0 | 160296.36 | 0.0 | 0.0 | 0 |
| 1 | 1 | PAYMENT | woman | mac | 0.496890 | 57 | country | 909 | 1 | 0 | C1666544295 | M2044282225 | 1864.28 | 21249.0 | 19384.72 | 0.0 | 0.0 | 0 |
| 2 | 1 | TRANSFER | man | pc | 0.781150 | 66 | capital | 2569 | 10 | 0 | C1305486145 | C553264065 | 181.00 | 181.0 | 0.00 | 0.0 | 0.0 | 1 |
| 3 | 1 | CASH_OUT | man | mac | 0.565068 | 31 | country | 1787 | 3 | 0 | C840083671 | C38997010 | 181.00 | 181.0 | 0.00 | 21182.0 | 0.0 | 1 |
| 4 | 1 | PAYMENT | unknow | mac | 0.517114 | 90 | country | 3997 | 8 | 0 | C2048537720 | M1230701703 | 11668.14 | 41554.0 | 29885.86 | 0.0 | 0.0 | 0 |
First, in order to know the correlation between variables, we will plot a heatmat of the spearman correlations between them. For that we will call the funtion _get_corrmatrix from the functions notebook. The function will only call for numerical variables. After, we will do a Cramer's V for all categorical variables (despite having numeric or string values).
from functions import get_corr_matrix
get_corr_matrix(dataset = df, size_figure = [10,8])
df.corr(method = 'spearman')
| step | connection_time | age | user_number | user_connections | security_alert | amount | oldbalanceOrg | newbalanceOrig | oldbalanceDest | newbalanceDest | isFraud | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| step | 1.000000 | 0.000528 | -0.000180 | -0.000666 | 0.000129 | -0.000862 | -0.035927 | -0.023625 | -0.022146 | 0.007136 | -0.001575 | 0.025968 |
| connection_time | 0.000528 | 1.000000 | 0.001739 | -0.000451 | 0.000466 | -0.519719 | -0.000075 | -0.000448 | 0.000088 | -0.001316 | -0.001251 | 0.000291 |
| age | -0.000180 | 0.001739 | 1.000000 | 0.001168 | 0.000635 | 0.000363 | 0.000168 | 0.000471 | -0.000734 | -0.000476 | 0.000243 | 0.000551 |
| user_number | -0.000666 | -0.000451 | 0.001168 | 1.000000 | -0.000555 | -0.001605 | 0.000717 | 0.000491 | -0.000147 | 0.000218 | 0.000799 | -0.000821 |
| user_connections | 0.000129 | 0.000466 | 0.000635 | -0.000555 | 1.000000 | -0.000040 | 0.000679 | -0.000732 | -0.001120 | 0.000716 | 0.001114 | 0.000764 |
| security_alert | -0.000862 | -0.519719 | 0.000363 | -0.001605 | -0.000040 | 1.000000 | 0.000543 | 0.000216 | 0.000236 | 0.001188 | 0.001610 | 0.000265 |
| amount | -0.035927 | -0.000075 | 0.000168 | 0.000717 | 0.000679 | 0.000543 | 1.000000 | 0.030319 | -0.092714 | 0.603029 | 0.671973 | 0.028488 |
| oldbalanceOrg | -0.023625 | -0.000448 | 0.000471 | 0.000491 | -0.000732 | 0.000216 | 0.030319 | 1.000000 | 0.814880 | 0.009227 | -0.026446 | 0.030924 |
| newbalanceOrig | -0.022146 | 0.000088 | -0.000734 | -0.000147 | -0.001120 | 0.000236 | -0.092714 | 0.814880 | 1.000000 | 0.021501 | -0.112781 | -0.027402 |
| oldbalanceDest | 0.007136 | -0.001316 | -0.000476 | 0.000218 | 0.000716 | 0.001188 | 0.603029 | 0.009227 | 0.021501 | 1.000000 | 0.924923 | -0.015825 |
| newbalanceDest | -0.001575 | -0.001251 | 0.000243 | 0.000799 | 0.001114 | 0.001610 | 0.671973 | -0.026446 | -0.112781 | 0.924923 | 1.000000 | -0.005359 |
| isFraud | 0.025968 | 0.000291 | 0.000551 | -0.000821 | 0.000764 | 0.000265 | 0.028488 | 0.030924 | -0.027402 | -0.015825 | -0.005359 | 1.000000 |
Naturally, we see a huge correlation between the old and new balances Origin as well as the old and new balances Destination. In a lower note, but still significant, we find correlation both balance Destination variables with amount, also as expected. Instead, it is remarkable that the correlation between old and new balanceOrigin is practically 0 and even negative (but a lower value) respectively.
On the other side we do find a significant inverse correlation (negative values), between security alert and connection time, which does not seem as expected. A priori someonw would though that those transactions with longer connection time will be those that have a security alert (a value 1). But having a negative correlation means the opposite.
First we define the categorical variables (excluding the target variable)
df_categorical_variables = df [["type", "gender", "device", "zone",
'security_alert']]
df_categorical_variables.head()
| type | gender | device | zone | security_alert | |
|---|---|---|---|---|---|
| 0 | PAYMENT | man | mac | capital | 1 |
| 1 | PAYMENT | woman | mac | country | 0 |
| 2 | TRANSFER | man | pc | capital | 0 |
| 3 | CASH_OUT | man | mac | country | 0 |
| 4 | PAYMENT | unknow | mac | country | 0 |
Then, we call from the fucntions auxiliar notebook the _cramersv function, and we plot its heatmap
from functions import cramers_v
rows = []
for var1 in df_categorical_variables :
col = []
for var2 in df_categorical_variables :
cramers = cramers_v(df_categorical_variables[var1], df_categorical_variables[var2]) # Cramer's V test
col.append(round(cramers,2)) # Keeping of the rounded value of the Cramer's V
rows.append(col)
cramers_results = np.array(rows)
df_vcramer = pd.DataFrame(cramers_results, columns = df_categorical_variables .columns,
index = df_categorical_variables .columns)
sns.heatmap(df_vcramer, vmin=0, vmax=1,center = 0, square = True,
linewidths = .5, cmap = 'coolwarm')
<AxesSubplot:>
df_vcramer
| type | gender | device | zone | security_alert | |
|---|---|---|---|---|---|
| type | 1.0 | 0.0 | 0.00 | 0.0 | 0.00 |
| gender | 0.0 | 1.0 | 0.00 | 0.0 | 0.00 |
| device | 0.0 | 0.0 | 1.00 | 0.0 | 0.57 |
| zone | 0.0 | 0.0 | 0.00 | 1.0 | 0.00 |
| security_alert | 0.0 | 0.0 | 0.57 | 0.0 | 1.00 |
There is only one significant correlation being the device and security alert.
Finally we do a spearman correlation between the continous numerical variables, in case we can obtain any additional information than with the pearson correlation. For this, as before, we set a new dataframe with only the continous numerical variables.
df_continous_vairables = df [['step','connection_time','age', 'user_number',
'user_connections', 'amount','oldbalanceOrg',
'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']]
df_continous_vairables.head()
| step | connection_time | age | user_number | user_connections | amount | oldbalanceOrg | newbalanceOrig | oldbalanceDest | newbalanceDest | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.140039 | 85 | 138 | 5 | 9839.64 | 170136.0 | 160296.36 | 0.0 | 0.0 |
| 1 | 1 | 0.496890 | 57 | 909 | 1 | 1864.28 | 21249.0 | 19384.72 | 0.0 | 0.0 |
| 2 | 1 | 0.781150 | 66 | 2569 | 10 | 181.00 | 181.0 | 0.00 | 0.0 | 0.0 |
| 3 | 1 | 0.565068 | 31 | 1787 | 3 | 181.00 | 181.0 | 0.00 | 21182.0 | 0.0 |
| 4 | 1 | 0.517114 | 90 | 3997 | 8 | 11668.14 | 41554.0 | 29885.86 | 0.0 | 0.0 |
Also, as with the Pearson one, we call from the functions auxiliar notebook the _get_corrmatrix but this time we change the "metodo" (method) to pearson.
get_corr_matrix(dataset = df_continous_vairables,
metodo = 'pearson', size_figure = [10,8])
df_continous_vairables.corr(method = 'pearson')
| step | connection_time | age | user_number | user_connections | amount | oldbalanceOrg | newbalanceOrig | oldbalanceDest | newbalanceDest | |
|---|---|---|---|---|---|---|---|---|---|---|
| step | 1.000000 | 0.001068 | 0.000201 | -0.000356 | -0.000140 | -0.025996 | -0.006782 | -0.007182 | -0.002251 | -0.019502 |
| connection_time | 0.001068 | 1.000000 | 0.001739 | -0.000450 | 0.000466 | 0.001755 | -0.001279 | -0.001342 | -0.000199 | -0.000029 |
| age | 0.000201 | 0.001739 | 1.000000 | 0.001168 | 0.000634 | 0.000076 | 0.000155 | 0.000119 | 0.000441 | 0.000377 |
| user_number | -0.000356 | -0.000450 | 0.001168 | 1.000000 | -0.000554 | 0.001476 | 0.000379 | 0.000351 | -0.001038 | -0.000588 |
| user_connections | -0.000140 | 0.000466 | 0.000634 | -0.000554 | 1.000000 | 0.000579 | -0.001799 | -0.001844 | 0.000902 | 0.001160 |
| amount | -0.025996 | 0.001755 | 0.000076 | 0.001476 | 0.000579 | 1.000000 | 0.004864 | -0.001132 | 0.215562 | 0.311942 |
| oldbalanceOrg | -0.006782 | -0.001279 | 0.000155 | 0.000379 | -0.001799 | 0.004864 | 1.000000 | 0.999050 | 0.093305 | 0.064050 |
| newbalanceOrig | -0.007182 | -0.001342 | 0.000119 | 0.000351 | -0.001844 | -0.001132 | 0.999050 | 1.000000 | 0.095179 | 0.063722 |
| oldbalanceDest | -0.002251 | -0.000199 | 0.000441 | -0.001038 | 0.000902 | 0.215562 | 0.093305 | 0.095179 | 1.000000 | 0.978401 |
| newbalanceDest | -0.019502 | -0.000029 | 0.000377 | -0.000588 | 0.001160 | 0.311942 | 0.064050 | 0.063722 | 0.978401 | 1.000000 |
We obtain similar results compared with the Spearman correlation only the amount and the "balances" are correlated.
Since we do have NA in some variables, and as we stated at the beigining this null values will be treated as another outcome of the variable, for the purpose of the variables plotting these values will be changed to "Other".
na_values = {'gender': 'Other', 'device': 'Other', 'zone': 'Other'}
df = df.fillna(value=na_values)
df[df["isFraud"] == 0].hist(column="step",color="darkolivegreen",bins=95)
plt.xlabel("Step (unit = 1 hour)")
plt.ylabel("Number of Transancions")
plt.title("Number of Valid Transactions troughout the time")
df[df["isFraud"] == 1].hist(column ="step",color="darkred",bins=95)
plt.xlabel("Step (unit = 1 hour)")
plt.ylabel("Number of Transancions")
plt.title("Number of Fraudulent Transactions troughout the time")
plt.tight_layout()
plt.show()
First this to note is the fact that, as we can foresee, the scales are completely different, being the maximun value in the valid transactions more than 50,000 transactions per hour and for those that are fraud, less than 25. Despite this, it is notable the small number of valid transactions arround the hour 50 and the hour 90 (if any) and the maintenance of similar values of fraudulent hours during those times.
df[df["isFraud"] == 0].hist(column="connection_time",color="darkolivegreen",bins=50)
plt.xlabel("Connection time (form 0 to 1 second)")
plt.ylabel("Number of Transancions")
plt.title("Valid transactions depending on connection time")
df[df["isFraud"] == 1].hist(column ="connection_time",color="darkred",bins=50)
plt.xlabel("Connection time (form 0 to 1 second)")
plt.ylabel("Number of Transancions")
plt.title("Fraudulent transactions depending on connection time")
plt.tight_layout()
plt.show()
Again, bearing in mind with the scale, the connection time it is in fact less insightfull than the step. Nothing notable.
df[df["isFraud"] == 0].hist(column="amount",color="darkolivegreen",bins=50)
plt.xlabel("transactions amount")
plt.ylabel("Number of Transancions")
plt.title("Valid transactions depending on amount")
df[df["isFraud"] == 1].hist(column ="amount",color="darkred",bins=50)
plt.xlabel("transactions amount")
plt.ylabel("Number of Transancions")
plt.title("Fraudulent transactions depending on amount")
plt.tight_layout()
plt.show()
Here we can reasure what was states in the outliers study: The fraudulent transactions proportion increases whenever the amount is higher.
df[df["isFraud"] == 0].hist(column="user_number",color="darkolivegreen",bins=50)
plt.xlabel("Number of concurrent users during the transaction")
plt.ylabel("Number of Transancions")
plt.title("Valid transactions depending on concurrent users")
df[df["isFraud"] == 1].hist(column ="user_number",color="darkred",bins=50)
plt.xlabel("Number of concurrent users during the transaction")
plt.ylabel("Number of Transancions")
plt.title("Fruaduent transactions depending on concurrent users")
plt.tight_layout()
plt.show()
Bearing in mind with the scale, the connection time it is in fact less insightfull than the step. Nothing notable.
df[df["isFraud"] == 0].hist(column="user_connections",color="darkolivegreen")
plt.xlabel("Number of connections of such user during the day")
plt.ylabel("Number of Transancions")
plt.title("Valid transactions depending on number of connection of such user")
df[df["isFraud"] == 1].hist(column ="user_connections",color="darkred")
plt.xlabel("Number of connections of such user during the day")
plt.ylabel("Number of Transancions")
plt.title("Fruaduent transactions depending on number of connection of such user")
plt.tight_layout()
plt.show()
A priori, nothing to highlight.
df[df["isFraud"] == 0].type.value_counts().plot(kind='bar')
plt.ylabel("Number of Transancions")
plt.title("Valid transactions depending on transaction type")
plt.show()
df[df["isFraud"] == 1].type.value_counts().plot(kind='bar')
plt.ylabel("Number of Transancions")
plt.title("Fraudulent transactions depending on transaction type")
plt.show()
We can observe that only cash_out and Transfer type of transactions are those in which fraud is committed. Also is noticiable higher the proportion of fraudulent transactions in Transfer than in cash out.
First thing to notice is that the Dataset is completely unbalanced, being the not fraud variables the 99.89% of the cases.
The fact that one of the variables is race, we will drop it and not analyze it to comply with the GDPR.
Also, there are some outliers, specially in amount, but those have almost 10 times more probability to be fraud compared with the whole dataset, hence we will use such data.
We have into account that we have not yet analyze the nameOrig and nameDest variables, since we forecast a transform in the variables. Both are composed by two diferent thing a letter (C or M) and the ID number of the client. This transformation will be performed in the next notebook.
There are some expected correlations between the balances as well as with amount and Destinatary balances. Also there are correlations between device type and security alert, as well as an negative correlation between connection time and security alert. Since we understand none of these variables represent the same thing, we will not drop any for our future models based on its correlation.
Highly noticiable is the variable steps and its behaviour with isFraud, that is so, that we will create a new variable representing the variables per step, as we have manage to understand that whenever there are lower transactions the probability of fraud skyrockets.
For transaction type, we have learn that only Transfers and Cash outs are the ones that can be Fraud.